Quick statistic based and univariate audit of the different columns
This US Census dataset contains detailed but anonymised information for approximately 300,000 people.
This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html Donor: Terran Lane and Ronny Kohavi Data Mining and Visualization e-mail: terran@ecn.purdue.edu, ronnyk@sgi.com for questions.
The data was split into train/test in approximately 2/3, 1/3 proportions using MineSet’s MIndUtil mineset-to-mlc.
Prediction task is to determine the income level for the person represented by the record. Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database. The goal field of this data, however, was drawn from the “total person income” field rather than the “adjusted gross income” and may, therefore, behave differently than the orginal ADULT goal field.
More information detailing the meaning of the attributes can be found in http://www.bls.census.gov/cps/cpsmain.htm To make use of the data descriptions at this site, the following mappings to the Census Bureau’s internal database column names will be needed:
| Data descriptions | Database column names |
|---|---|
| age | AAGE |
| class of worker | ACLSWKR |
| industry code | ADTIND |
| occupation code | ADTOCC |
| adjusted gross income | AGI |
| education | AHGA |
| wage per hour | AHRSPAY |
| enrolled in edu inst last wk | AHSCOL |
| marital status | AMARITL |
| major industry code | AMJIND |
| major occupation code | AMJOCC |
| mace | ARACE |
| hispanic Origin | AREORGN |
| sex | ASEX |
| member of a labor union | AUNMEM |
| reason for unemployment | AUNTYPE |
| full or part time employment stat | AWKSTAT |
| capital gains | CAPGAIN |
| capital losses | CAPLOSS |
| divdends from stocks | DIVVAL |
| federal income tax liability | FEDTAX |
| tax filer status | FILESTAT |
| region of previous residence | GRINREG |
| state of previous residence | GRINST |
| detailed household and family stat | HHDFMX |
| detailed household summary in household | HHDREL |
| instance weight | MARSUPWT |
| migration code-change in msa | MIGMTR1 |
| migration code-change in reg | MIGMTR3 |
| migration code-move within reg | MIGMTR4 |
| live in this house 1 year ago | MIGSAME |
| migration prev res in sunbelt | MIGSUN |
| num persons worked for employer | NOEMP |
| family members under 18 | PARENT |
| total person earnings | PEARNVAL |
| country of birth father | PEFNTVTY |
| country of birth mother | PEMNTVTY |
| country of birth self | PENATVTY |
| citizenship | PRCITSHP |
| total person income | PTOTVAL |
| own business or self employed | SEOTR |
| taxable income amount | TAXINC |
| fill inc questionnaire for veteran’s admin | VETQVA |
| veterans benefits | VETYN |
| weeks worked in year | WKSWORK |
Basic statistics for this data set:
Number of instances data = 199523
Duplicate or conflicting instances : 46716
Number of instances in test = 99762
Duplicate or conflicting instances : 20936
Class probabilities for income-projected.test file
Probability for the label ‘- 50000’ : 93.80%
Probability for the label ‘50000+’ : 6.20%
Majority accuracy: 93.80% on value - 50000
Number of attributes = 40 (continuous : 7 nominal : 33)
# import the learning data set into da dataframe
df <- read.csv('data/census_income_learn.csv', header = FALSE)
label <- df[[42]] == " 50000+."
We are going to look at column one by one
91 distinct values for attribute #0 (age) continuous
ages <- df[[1]]
summary(ages)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 33.00 34.49 50.00 90.00
plot(density(ages))
boxplot(ages, ages[label], ages[!label], horizontal = TRUE, names=c("global", "gt50000", "lt50000"))
We can see a difference between the distribution of people with an imcome greater than 50K.
9 distinct values for attribute #1 (class of worker) nominal
class of worker: Not in universe, Federal government, Local government, Never worked, Private, Self-employed-incorporated, Self-employed-not incorporated, State government, Without pay.
class.of.worker <- df[[2]]
par(mar=c(5.1, 13 ,4.1 ,2.1))
barplot(sort(prop.table(summary(class.of.worker[label]))),las=1, horiz = TRUE)
barplot(sort(prop.table(summary(class.of.worker[!label]))),las=1, horiz = TRUE)
private, Self-employed-incorporated, Not in universe, gouvernements (concat of 3 gouv) looks to have
52 distinct values for attribute #2 (detailed industry recode) nominal
detailed industry recode: 0, 40, 44, 2, 43, 47, 48, 1, 11, 19, 24, 25, 32, 33, 34, 35, 36, 37, 38, 39, 4, 42, 45, 5, 15, 16, 22, 29, 31, 50, 14, 17, 18, 28, 3, 30, 41, 46, 51, 12, 13, 21, 23, 26, 6, 7, 9, 49, 27, 8, 10, 20.
industry.code <- as.factor(df[[3]])
par(mar=c(5.1, 5 ,4.1 ,2.1))
barplot(prop.table(summary(industry.code[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(industry.code[!label])),las=1, horiz = TRUE)
0 is mostly for - 50000 we can diferenciate 0 and non 0
47 distinct values for attribute #3 (detailed occupation recode) nominal
detailed occupation recode: 0, 12, 31, 44, 19, 32, 10, 23, 26, 28, 29, 42, 40, 34, 14, 36, 38, 2, 20, 25, 37, 41, 27, 24, 30, 43, 33, 16, 45, 17, 35, 22, 18, 39, 3, 15, 13, 46, 8, 21, 9, 4, 6, 5, 1, 11, 7.
occupation.code <- as.factor(df[[4]])
par(mar=c(5.1, 5 ,4.1 ,2.1))
barplot(prop.table(summary(occupation.code[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(occupation.code[!label])),las=1, horiz = TRUE)
Same observation the big diference is the proportion of 0
17 distinct values for attribute #4 (education) nominal
education: Children, 7th and 8th grade, 9th grade, 10th grade, High school graduate, 11th grade, 12th grade no diploma, 5th or 6th grade, Less than 1st grade, Bachelors degree(BA AB BS), 1st 2nd 3rd or 4th grade, Some college but no degree, Masters degree(MA MS MEng MEd MSW MBA), Associates degree-occup /vocational, Associates degree-academic program, Doctorate degree(PhD EdD), Prof school degree (MD DDS DVM LLB JD).
education <- as.factor(df[[5]])
par(mar=c(5.1, 19,4.1 ,2.1))
barplot(prop.table(summary(education[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(education[!label])),las=1, horiz = TRUE)
there is two gategory:
Before Hight school: Children, 7th and 8th grade, 9th grade, 10th grade, High school graduate, 11th grade, 12th grade no diploma, 5th or 6th grade, Less than 1st grade, 1st 2nd 3rd or 4th grade
Post graduate: Bachelors degree(BA AB BS), Masters degree(MA MS MEng MEd MSW MBA), Associates degree-occup /vocational, Associates degree-academic program, Doctorate degree(PhD EdD), Prof school degree (MD DDS DVM LLB JD).
Some college but no degree is in the same proportion in both categories
1240 distinct values for attribute #5 (wage per hour) continuous
wage per hour: continuous. ### Statistics
wage.per.hour <- df[[6]]
summary(wage.per.hour[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 81.64 0.00 9999.00
summary(wage.per.hour[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 53.69 0.00 9916.00
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(wage.per.hour))
boxplot(wage.per.hour[label], wage.per.hour[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
This variable should be combine with another one.
3 distinct values for attribute #6 (enroll in edu inst last wk) nominal
enroll in edu inst last wk: Not in universe, High school, College or university.
enroll.in.edu.inst.last.wk <- df[[7]]
par(mar=c(5.1, 10 ,4.1 ,2.1))
barplot(prop.table(summary(enroll.in.edu.inst.last.wk[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(enroll.in.edu.inst.last.wk[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
7 distinct values for attribute #7 (marital stat) nominal
marital stat: Never married, Married-civilian spouse present, Married-spouse absent, Separated, Divorced, Widowed, Married-A F spouse present.
marital.stat <- df[[8]]
par(mar=c(5.1, 13 ,4.1 ,2.1))
barplot(prop.table(summary(marital.stat[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(marital.stat[!label])),las=1, horiz = TRUE)
Never married, Married-civillian spouse present, others
24 distinct values for attribute #8 (major industry code) nominal
major industry code: Not in universe or children, Entertainment, Social services, Agriculture, Education, Public administration, Manufacturing-durable goods, Manufacturing-nondurable goods, Wholesale trade, Retail trade, Finance insurance and real estate, Private household services, Business and repair services, Personal services except private HH, Construction, Medical except hospital, Other professional services, Transportation, Utilities and sanitary services, Mining, Communications, Hospital services, Forestry and fisheries, Armed Forces.
major.industry.code <- df[[9]]
par(mar=c(5.1, 15 ,4.1 ,2.1))
barplot(prop.table(summary(major.industry.code[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(major.industry.code[!label])),las=1, horiz = TRUE)
# compare major industry code not in universe and industry code 0
major.industry.code == ' Not in universe or children' && industry.code == 0
## [1] TRUE
This field is a duplicate from industry code
15 distinct values for attribute #9 (major occupation code) nominal
major occupation code: Not in universe, Professional specialty, Other service, Farming forestry and fishing, Sales, Adm support including clerical, Protective services, Handlers equip cleaners etc , Precision production craft & repair, Technicians and related support, Machine operators assmblrs & inspctrs, Transportation and material moving, Executive admin and managerial, Private household services, Armed Forces.
major.occupation.code <- df[[10]]
par(mar=c(5.1, 16 ,4.1 ,2.1))
barplot(prop.table(summary(major.occupation.code[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(major.occupation.code[!label])),las=1, horiz = TRUE)
# compare major occupation code not in universe and occupation code 0
major.occupation.code == ' Not in universe' && occupation.code == 0
## [1] TRUE
# compare major occupation code not in universe and weeks worked in year
major.occupation.code == " Not in universe" && df[[40]] == 0
## [1] TRUE
we can create a dummie variable for Sales, Professional specialty, Executive admin and managerial
5 distinct values for attribute #10 (race) nominal
race: White, Black, Other, Amer Indian Aleut or Eskimo, Asian or Pacific Islander. ### Statistics
race <- df[[11]]
par(mar=c(5.1,12,4.1,2.1))
barplot(prop.table(summary(race[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(race[!label])),las=1, horiz = TRUE)
we can divide in 2 categries:
White, Asian.
Black, Other, Amer Indian Aleut or Eskimo or Pacific Islander
hispanic origin: Mexican (Mexicano), Mexican-American, Puerto Rican, Central or South American, All other, Other Spanish, Chicano, Cuban, Do not know, NA.
10 distinct values for attribute #11 (hispanic origin) nominal
hispanic.origin <- df[[12]]
par(mar=c(5.1,12,4.1,2.1))
barplot(prop.table(summary(hispanic.origin[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(hispanic.origin[!label])),las=1, horiz = TRUE)
we can be merge with race in the same groupe as Black, Other, Amer Indian Aleut or Eskimo or Pacific Islander.
2 distinct values for attribute #12 (sex) nominal
sex: Female, Male.
sex <- df[[13]]
par(mar=c(5.1,5,4.1,2.1))
barplot(prop.table(summary(sex[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(sex[!label])),las=1, horiz = TRUE)
We can keep this 2 variable
3 distinct values for attribute #13 (member of a labor union) nominal
member of a labor union: Not in universe, No, Yes.
member.labor.union <- df[[14]]
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(member.labor.union[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(member.labor.union[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
6 distinct values for attribute #14 (reason for unemployment) nominal
reason for unemployment: Not in universe, Re-entrant, Job loser - on layoff, New entrant, Job leaver, Other job loser.
reason.unemployement <- df[[15]]
par(mar=c(5.1,10,4.1,2.1))
barplot(prop.table(summary(reason.unemployement[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(reason.unemployement[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
8 distinct values for attribute #15 (full or part time employment stat) nominal
full or part time employment stat: Children or Armed Forces, Full-time schedules, Unemployed part- time, Not in labor force, Unemployed full-time, PT for non-econ reasons usually FT, PT for econ reasons usually PT, PT for econ reasons usually FT.
full.part.employment.stat <- df[[16]]
par(mar=c(5.1,15,4.1,2.1))
barplot(prop.table(summary(full.part.employment.stat[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(full.part.employment.stat[!label])),las=1, horiz = TRUE)
we can divide in 2 categries:
132 distinct values for attribute #16 (capital gains) continuous
capital gains: continuous.
capital.gains <- df[[17]]
summary(capital.gains[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 4831 0 100000
summary(capital.gains[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 143.8 0.0 100000.0
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(capital.gains[label]))
plot(density(capital.gains[!label]))
boxplot(capital.gains[label], capital.gains[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
We may combine with other variables
113 distinct values for attribute #17 (capital losses) continuous
capital losses: continuous.
capital.losses <- df[[18]]
summary(capital.losses[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 193.1 0.0 3683.0
summary(capital.losses[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 27 0 4608
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(capital.losses[label]))
plot(density(capital.losses[!label]))
boxplot(capital.losses[label], capital.losses[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
We may combine with other variables
1478 distinct values for attribute #18 (dividends from stocks) continuous
dividends from stocks: continuous.
dividends.stocks <- df[[19]]
summary(dividends.stocks[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1553 363 100000
summary(dividends.stocks[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 107.8 0.0 39000.0
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(dividends.stocks[label]))
plot(density(dividends.stocks[!label]))
boxplot(dividends.stocks[label], dividends.stocks[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
We may combine with other variables
6 distinct values for attribute #19 (tax filer stat) nominal
tax filer stat: Nonfiler, Joint one under 65 & one 65+, Joint both under 65, Single, Head of household, Joint both 65+.
tax.filer.stat <- df[[20]]
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary(tax.filer.stat[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(tax.filer.stat[!label])),las=1, horiz = TRUE)
2 categories:
Nonfiler
Joint both under 65
6 distinct values for attribute #20 (region of previous residence) nominal
region of previous residence: Not in universe, South, Northeast, West, Midwest, Abroad.
region.previous.residence <- df[[21]]
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(region.previous.residence[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(region.previous.residence[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
51 distinct values for attribute #21 (state of previous residence) nominal
state of previous residence: Not in universe, Utah, Michigan, North Carolina, North Dakota, Virginia, Vermont, Wyoming, West Virginia, Pennsylvania, Abroad, Oregon, California, Iowa, Florida, Arkansas, Texas, South Carolina, Arizona, Indiana, Tennessee, Maine, Alaska, Ohio, Montana, Nebraska, Mississippi, District of Columbia, Minnesota, Illinois, Kentucky, Delaware, Colorado, Maryland, Wisconsin, New Hampshire, Nevada, New York, Georgia, Oklahoma, New Mexico, South Dakota, Missouri, Kansas, Connecticut, Louisiana, Alabama, Massachusetts, Idaho, New Jersey.
state.previous.residence <- gsub("?",NA,df[[22]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(sort(prop.table(summary.factor(state.previous.residence[label])))[!names(sort(prop.table(summary.factor(state.previous.residence[label])))) %in% c("NA's", " Not in universe")],las=1, horiz = TRUE)
barplot(sort(prop.table(summary.factor(state.previous.residence[!label])))[!names(sort(prop.table(summary.factor(state.previous.residence[!label])))) %in% c("NA's", " Not in universe")],las=1, horiz = TRUE)
#ratio of Not in the universe and not applicable
prop.table(summary.factor(state.previous.residence[label]))[names(summary.factor(state.previous.residence[label])) %in% c("NA's", " Not in universe")]
## Not in universe NA's
## 0.950088839 0.003634308
prop.table(summary.factor(state.previous.residence[!label]))[names(summary.factor(state.previous.residence[!label])) %in% c("NA's", " Not in universe")]
## Not in universe NA's
## 0.919018280 0.003542783
This variable doesn’t very important we may ignore it at first.
38 distinct values for attribute #22 (detailed household and family stat) nominal
detailed household and family stat: Child <18 never marr not in subfamily, Other Rel <18 never marr child of subfamily RP, Other Rel <18 never marr not in subfamily, Grandchild <18 never marr child of subfamily RP, Grandchild <18 never marr not in subfamily, Secondary individual, In group quarters, Child under 18 of RP of unrel subfamily, RP of unrelated subfamily, Spouse of householder, Householder, Other Rel <18 never married RP of subfamily, Grandchild <18 never marr RP of subfamily, Child <18 never marr RP of subfamily, Child <18 ever marr not in subfamily, Other Rel <18 ever marr RP of subfamily, Child <18 ever marr RP of subfamily, Nonfamily householder, Child <18 spouse of subfamily RP, Other Rel <18 spouse of subfamily RP, Other Rel <18 ever marr not in subfamily, Grandchild <18 ever marr not in subfamily, Child 18+ never marr Not in a subfamily, Grandchild 18+ never marr not in subfamily, Child 18+ ever marr RP of subfamily, Other Rel 18+ never marr not in subfamily, Child 18+ never marr RP of subfamily, Other Rel 18+ ever marr RP of subfamily, Other Rel 18+ never marr RP of subfamily, Other Rel 18+ spouse of subfamily RP, Other Rel 18+ ever marr not in subfamily, Child 18+ ever marr Not in a subfamily, Grandchild 18+ ever marr not in subfamily, Child 18+ spouse of subfamily RP, Spouse of RP of unrelated subfamily, Grandchild 18+ ever marr RP of subfamily, Grandchild 18+ never marr RP of subfamily, Grandchild 18+ spouse of subfamily RP.
household.family.stat <- df[[23]]
par(mar=c(5.1,19,4.1,2.1))
barplot(prop.table(summary(household.family.stat[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(household.family.stat[!label])),las=1, horiz = TRUE)
2 categories:
householders
non householders
8 distinct values for attribute #23 (detailed household summary in household) nominal
detailed household summary in household: Child under 18 never married, Other relative of householder, Nonrelative of householder, Spouse of householder, Householder, Child under 18 ever married, Group Quarters- Secondary individual, Child 18 or older.
household.summary.household <- df[[24]]
par(mar=c(5.1,16,4.1,2.1))
barplot(prop.table(summary(household.summary.household[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(household.summary.household[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first because redundant with the previous one.
| instance weight: ignore. instance weight: continuous.
instance.weight <- df[[25]]
summary(instance.weight[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.82 1123.00 1684.00 1796.00 2241.00 8433.00
summary(instance.weight[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.87 1057.00 1613.00 1737.00 2186.00 18660.00
plot(density(instance.weight[label]))
plot(density(instance.weight[!label]))
boxplot(instance.weight[label], instance.weight[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
This variable doesn’t very important we may ignore it at first.
9 distinct values for attribute #25 (migration code-change in reg) nominal
migration code-change in msa: Not in universe, Nonmover, MSA to MSA, NonMSA to nonMSA, MSA to nonMSA, NonMSA to MSA, Abroad to MSA, Not identifiable, Abroad to nonMSA.
migration.code.msa <- gsub("?",NA,df[[26]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.msa[label])),las=1, horiz = TRUE)
barplot(prop.table(summary.factor(migration.code.msa[!label])),las=1, horiz = TRUE)
prop.table(summary.factor(migration.code.msa[label]))[names(prop.table(summary.factor(migration.code.msa[label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.msa[!label]))[names(prop.table(summary.factor(migration.code.msa[!label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4131484 0.4977691
This variable doesn’t very important we may ignore it at first.
9 distinct values for attribute #25 (migration code-change in reg) nominal
migration code-change in reg: Not in universe, Nonmover, Same county, Different county same state, Different state same division, Abroad, Different region, Different division same region.
migration.code.change.reg <- gsub("?",NA,df[[27]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.change.reg[label])),las=1, horiz = TRUE)
barplot(prop.table(summary.factor(migration.code.change.reg[!label])),las=1, horiz = TRUE)
#ratio of Nonmover and not applicable
prop.table(summary.factor(migration.code.change.reg[label]))[names(prop.table(summary.factor(migration.code.change.reg[label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.change.reg[!label]))[names(prop.table(summary.factor(migration.code.change.reg[!label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4131484 0.4977691
This variable doesn’t very important we may ignore it at first.
10 distinct values for attribute #26 (migration code-move within reg) nominal
migration code-move within reg: Not in universe, Nonmover, Same county, Different county same state, Different state in West, Abroad, Different state in Midwest, Different state in South, Different state in Northeast.
migration.code.move.reg <- gsub("?",NA,df[[28]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.move.reg[label])),las=1, horiz = TRUE)
barplot(prop.table(summary.factor(migration.code.move.reg[!label])),las=1, horiz = TRUE)
#ratio of Nonmover and not applicable
prop.table(summary.factor(migration.code.move.reg[label]))[names(prop.table(summary.factor(migration.code.move.reg[label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.move.reg[!label]))[names(prop.table(summary.factor(migration.code.move.reg[!label]))) %in% c("NA's", " Nonmover")]
## Nonmover NA's
## 0.4131484 0.4977691
This variable doesn’t very important we may ignore it at first.
3 distinct values for attribute #27 (live in this house 1 year ago) nominal
live in this house 1 year ago: Not in universe under 1 year old, Yes, No.
live.in.this.house.1y.ago <- df[[29]]
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary(live.in.this.house.1y.ago[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(live.in.this.house.1y.ago[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
4 distinct values for attribute #28 (migration prev res in sunbelt) nominal
migration prev res in sunbelt: Not in universe, Yes, No.
migration.in.sunbelt <- gsub("?",NA,df[[30]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.in.sunbelt[label]))[!names(prop.table(summary.factor(migration.in.sunbelt[label]))) %in% c("NA's")],las=1, horiz = TRUE)
barplot(prop.table(summary.factor(migration.in.sunbelt[!label]))[!names(prop.table(summary.factor(migration.in.sunbelt[!label]))) %in% c("NA's")],las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary.factor(migration.in.sunbelt[label]))[names(prop.table(summary.factor(migration.in.sunbelt[label]))) %in% c("NA's")]
## NA's
## 0.5284284
prop.table(summary.factor(migration.in.sunbelt[!label]))[names(prop.table(summary.factor(migration.in.sunbelt[!label]))) %in% c("NA's")]
## NA's
## 0.4977691
This variable doesn’t very important we may ignore it at first.
7 distinct values for attribute #29 (num persons worked for employer) continuous
num persons worked for employer: continuous.
num.persons.worked.employer <- df[[31]]
summary(num.persons.worked.employer[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 4.004 6.000 6.000
summary(num.persons.worked.employer[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.821 4.000 6.000
plot(density(num.persons.worked.employer[label]))
plot(density(num.persons.worked.employer[!label]))
boxplot(num.persons.worked.employer[label], num.persons.worked.employer[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
We can clearly see a difference between the distribution of people with an income greater than 50K.
5 distinct values for attribute #30 (family members under 18) nominal
family members under 18: Both parents present, Neither parent present, Mother only present, Father only present, Not in universe.
family.members.under.18 <- as.factor(df[[32]])
par(mar=c(5.1,10,4.1,2.1))
barplot(prop.table(summary(family.members.under.18[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(family.members.under.18[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first, it is redundant with age and sone other variable because people with an income grater than 50 are mostly above 18.
43 distinct values for attribute #31 (country of birth father) nominal
country of birth father: Mexico, United-States, Puerto-Rico, Dominican-Republic, Jamaica, Cuba, Portugal, Nicaragua, Peru, Ecuador, Guatemala, Philippines, Canada, Columbia, El-Salvador, Japan, England, Trinadad&Tobago, Honduras, Germany, Taiwan, Outlying-U S (Guam USVI etc), India, Vietnam, China, Hong Kong, Cambodia, France, Laos, Haiti, South Korea, Iran, Greece, Italy, Poland, Thailand, Yugoslavia, Holand-Netherlands, Ireland, Scotland, Hungary, Panama.
country.birth.father <- gsub("?",NA,df[[33]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.father[label]))[!names(prop.table(summary.factor(country.birth.father[label]))) %in% c("NA's")],las=1, horiz = TRUE)
barplot(prop.table(summary.factor(country.birth.father[!label]))[!names(prop.table(summary.factor(country.birth.father[!label]))) %in% c("NA's")],las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary.factor(country.birth.father[label]))[names(prop.table(summary.factor(country.birth.father[label]))) %in% c("NA's")]
## NA's
## 0.04433856
prop.table(summary.factor(country.birth.father[!label]))[names(prop.table(summary.factor(country.birth.father[!label]))) %in% c("NA's")]
## NA's
## 0.03293773
This variable doesn’t very important we may ignore it at first.
43 distinct values for attribute #32 (country of birth mother) nominal
country of birth mother: India, Mexico, United-States, Puerto-Rico, Dominican-Republic, England, Honduras, Peru, Guatemala, Columbia, El-Salvador, Philippines, France, Ecuador, Nicaragua, Cuba, Outlying-U S (Guam USVI etc), Jamaica, South Korea, China, Germany, Yugoslavia, Canada, Vietnam, Japan, Cambodia, Ireland, Laos, Haiti, Portugal, Taiwan, Holand-Netherlands, Greece, Italy, Poland, Thailand, Trinadad&Tobago, Hungary, Panama, Hong Kong, Scotland, Iran.
country.birth.mother <- gsub("?",NA,df[[34]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.mother[label]))[!names(prop.table(summary.factor(country.birth.mother[label]))) %in% c("NA's")],las=1, horiz = TRUE)
barplot(prop.table(summary.factor(country.birth.mother[!label]))[!names(prop.table(summary.factor(country.birth.mother[!label]))) %in% c("NA's")],las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary.factor(country.birth.mother[label]))[names(prop.table(summary.factor(country.birth.mother[label]))) %in% c("NA's")]
## NA's
## 0.03787756
prop.table(summary.factor(country.birth.mother[!label]))[names(prop.table(summary.factor(country.birth.mother[!label]))) %in% c("NA's")]
## NA's
## 0.03019114
This variable doesn’t very important we may ignore it at first.
43 distinct values for attribute #33 (country of birth self) nominal
country of birth self: United-States, Mexico, Puerto-Rico, Peru, Canada, South Korea, India, Japan, Haiti, El-Salvador, Dominican-Republic, Portugal, Columbia, England, Thailand, Cuba, Laos, Panama, China, Germany, Vietnam, Italy, Honduras, Outlying-U S (Guam USVI etc), Hungary, Philippines, Poland, Ecuador, Iran, Guatemala, Holand-Netherlands, Taiwan, Nicaragua, France, Jamaica, Scotland, Yugoslavia, Hong Kong, Trinadad&Tobago, Greece, Cambodia, Ireland.
country.birth.self <- gsub("?",NA,df[[34]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.self[label]))[!names(prop.table(summary.factor(country.birth.self[label]))) %in% c("NA's")],las=1, horiz = TRUE)
barplot(prop.table(summary.factor(country.birth.self[!label]))[!names(prop.table(summary.factor(country.birth.self[!label]))) %in% c("NA's")],las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary.factor(country.birth.self[label]))[names(prop.table(summary.factor(country.birth.self[label]))) %in% c("NA's")]
## NA's
## 0.03787756
prop.table(summary.factor(country.birth.self[!label]))[names(prop.table(summary.factor(country.birth.self[!label]))) %in% c("NA's")]
## NA's
## 0.03019114
This variable doesn’t very important we may ignore it at first. the only factor we can find is redundant with the race hispanic.
5 distinct values for attribute #34 (citizenship) nominal
citizenship: Native- Born in the United States, Foreign born- Not a citizen of U S , Native- Born in Puerto Rico or U S Outlying, Native- Born abroad of American Parent(s), Foreign born- U S citizen by naturalization.
citizenship <- do.call(rbind ,strsplit(as.character(df[[36]]), '-'))
native.citizen <- as.factor(citizenship[,1])
status.citizen <- as.factor(citizenship[,2])
par(mar=c(5.1,18,4.1,2.1))
barplot(prop.table(summary(native.citizen[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(native.citizen[!label])),las=1, horiz = TRUE)
barplot(prop.table(summary(status.citizen[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(status.citizen[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
3 distinct values for attribute #35 (own business or self employed) nominal
own business or self employed: 0, 2, 1.
own.business.or.self.employed <- as.factor(df[[37]])
par(mar=c(5.1,6,4.1,2.1))
barplot(prop.table(summary(own.business.or.self.employed[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(own.business.or.self.employed[!label])),las=1, horiz = TRUE)
This variable doesn’t very important we may ignore it at first.
3 distinct values for attribute #36 (fill inc questionnaire for veteran’s admin) nominal
fill inc questionnaire for veteran’s admin: Not in universe, Yes, No.
fill.inc.questionnaire.for.veterant.admin <- as.factor(df[[38]])
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(fill.inc.questionnaire.for.veterant.admin[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(fill.inc.questionnaire.for.veterant.admin[!label])),las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary(fill.inc.questionnaire.for.veterant.admin[label]))
## No Not in universe Yes
## 0.017363915 0.981343886 0.001292198
prop.table(summary(fill.inc.questionnaire.for.veterant.admin[!label]))
## No Not in universe Yes
## 0.007363432 0.990632731 0.002003837
Boolean yes/no
3 distinct values for attribute #37 (veterans benefits) nominal
veterans benefits: 0, 2, 1.
veterants.benefits <- as.factor(df[[39]])
par(mar=c(5.1,4,4.1,2.1))
barplot(prop.table(summary(veterants.benefits[label])),las=1, horiz = TRUE)
barplot(prop.table(summary(veterants.benefits[!label])),las=1, horiz = TRUE)
#ratio of not applicable
prop.table(summary(veterants.benefits[label]))
## 0 1 2
## 0.00000000 0.01865611 0.98134389
prop.table(summary(veterants.benefits[!label]))
## 0 1 2
## 0.253333048 0.009367269 0.737299683
Boolean 0 / 1 or 2
53 distinct values for attribute #38 (weeks worked in year) continuous
weeks worked in year: continuous.
weeks.worked.in.year <- df[[40]]
summary(weeks.worked.in.year[label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 52.00 52.00 48.07 52.00 52.00
summary(weeks.worked.in.year[!label])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 21.53 52.00 52.00
plot(density(weeks.worked.in.year[label]))
plot(density(weeks.worked.in.year[!label]))
boxplot(weeks.worked.in.year[label], weeks.worked.in.year[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))
We can clearly see a difference between the distribution of people with an income greater than 50K.
2 distinct values for attribute #39 (year) nominal year: 94, 95.
year <- as.factor(df[[41]])
sort(prop.table(summary(year[label])))
## 94 95
## 0.4715716 0.5284284
sort(prop.table(summary(year[!label])))
## 95 94
## 0.4977691 0.5022309
This variable doesn’t very important we may ignore it at first.
The column selected for the model are:
Age
Class of worker
Industrie code (0, !0)
Education
Wage per hour
Marital state
Major occupation code
Race boolean
Sex as a dummie variable " Male" / " Female"
Full or part time employment stat
Capital gains
Capital losses
Dividends from stocks
Tax filer stat
Detailed household and family stat as dummie variable " householders" or not
Num persons worked for employer
Fill inc questionnaire for veteran’s admin
Veterans benefits
Weeks worked in year